33 research outputs found
Image editing and interaction tools for visual expression
Digital photography is becoming extremely common in our daily life. However, images are difficult to edit and interact with. From a user's perspective, it is important to interact freely with the images on his/her smartphone or ipad. In this thesis we develop several image editing and interaction systems with this idea in mind. We aim for creating visual models with pre-computed internal structures
such that interaction is readily supported. We demonstrate that such interactable models,
driven by a user's hand, can render powerful visual expressiveness, and make static pixel arrays much more fun to play with.
The first system harnesses the editing power of vector graphics. We convert raster images
into a vector representation using Loop's subdivision surfaces. An image is represented by a multi-resolution feature-preserving sparse control mesh, with which image editing can be done at semantic level. A user can easily put a smile on a face image, or adjust the level of scene abstractness through a simple slider. The second system allows one to insert an object from image into a new scene. The key is to correct the shading on the object such that it goes consistently with the scene. Unlike traditional approach, we use a simple shape to
capture gross shading effects and a set of shading detail images to account for visual complexities. The high-frequency nature of these detail images allows a moderate range of interactive composition effects without causing alarming visual artifacts. The third system is on video clips instead of a single image. We proposed a fully automated algorithm to creat
Learning Structured Inference Neural Networks with Label Relations
Images of scenes have various objects as well as abundant attributes, and
diverse levels of visual categorization are possible. A natural image could be
assigned with fine-grained labels that describe major components,
coarse-grained labels that depict high level abstraction or a set of labels
that reveal attributes. Such categorization at different concept layers can be
modeled with label graphs encoding label information. In this paper, we exploit
this rich information with a state-of-art deep learning framework, and propose
a generic structured model that leverages diverse label relations to improve
image classification performance. Our approach employs a novel stacked label
prediction neural network, capturing both inter-level and intra-level label
semantics. We evaluate our method on benchmark image datasets, and empirical
results illustrate the efficacy of our model.Comment: Conference on Computer Vision and Pattern Recognition(CVPR) 201
Learning hierarchical poselets for human parsing
We consider the problem of human parsing with part-based models. Most previous work in part-based models only considers rigid parts (e.g. torso, head, half limbs) guided by human anatomy. We argue that this represen-tation of parts is not necessarily appropriate for human parsing. In this paper, we introduce hierarchical poselets – a new representation for human parsing. Hierarchical poselets can be rigid parts, but they can also be parts that cover large portions of human bodies (e.g. torso + left arm). In the extreme case, they can be the whole bod-ies. We develop a structured model to organize poselets in a hierarchical way and learn the model parameters in a max-margin framework. We demonstrate the superior per-formance of our proposed approach on two datasets with aggressive pose variations. 1
Improved fuzzy neural network control for the clamping force of Camellia fruit picking manipulator
During the operation of the vibrating mechanism, the push-shaking camellia fruit picking manipulator needs to ensure a constant force output of the clamping hydraulic motor in order to make sure that the camellia fruit tree trunk wouldn't loosen or damage, which may affect its later growth, during the picking process. In this regard, this paper derived the state space model of the valve-controlled clamping hydraulic motor system of the push-shaking camellia fruit picking manipulator, and the fuzzy wavelet neural network (FWNN) was designed on the basis of the traditional incremental PID control principle and the parameters of the neural network were optimized by the improved grey wolf optimizer (GWO). And then, the control system was simulated with the MATLAB/Simulink software without and with external interference, and compared and analyzed it with traditional PID controller and fuzzy PID (FPID) controller. The results show that the traditional PID controller and the FPID control have slow response and poor robustness, while the improved fuzzy wavelet neural network PID (IFWNN PID) controller possesses the characteristics of fast response and strong robustness, which can well meet the requirement of the constant clamping force of hydraulic motors. Finally, the field clamping test was carried out on the picking manipulator. The results show that the manipulator controlled by the IFWNN PID controller shortens the clamping time by 20.0% and reduces the clamping damage by 13.6% compared with the PID controller, which is verified that the designed controller can meet the clamping operation requirements of the camellia fruit picking machine
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision
The rapid evolution of Multi-modality Large Language Models (MLLMs) has
catalyzed a shift in computer vision from specialized models to general-purpose
foundation models. Nevertheless, there is still an inadequacy in assessing the
abilities of MLLMs on low-level visual perception and understanding. To address
this gap, we present Q-Bench, a holistic benchmark crafted to systematically
evaluate potential abilities of MLLMs on three realms: low-level visual
perception, low-level visual description, and overall visual quality
assessment. a) To evaluate the low-level perception ability, we construct the
LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped
with a human-asked question focusing on its low-level attributes. We then
measure the correctness of MLLMs on answering these questions. b) To examine
the description ability of MLLMs on low-level information, we propose the
LLDescribe dataset consisting of long expert-labelled golden low-level text
descriptions on 499 images, and a GPT-involved comparison pipeline between
outputs of MLLMs and the golden descriptions. c) Besides these two tasks, we
further measure their visual quality assessment ability to align with human
opinion scores. Specifically, we design a softmax-based strategy that enables
MLLMs to predict quantifiable quality scores, and evaluate them on various
existing image quality assessment (IQA) datasets. Our evaluation across the
three abilities confirms that MLLMs possess preliminary low-level visual
skills. However, these skills are still unstable and relatively imprecise,
indicating the need for specific enhancements on MLLMs towards these abilities.
We hope that our benchmark can encourage the research community to delve deeper
to discover and enhance these untapped potentials of MLLMs. Project Page:
https://vqassessment.github.io/Q-Bench.Comment: 25 pages, 14 figures, 9 tables, preprint versio
Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models
Multi-modality foundation models, as represented by GPT-4V, have brought a
new paradigm for low-level visual perception and understanding tasks, that can
respond to a broad range of natural human instructions in a model. While
existing foundation models have shown exciting potentials on low-level visual
tasks, their related abilities are still preliminary and need to be improved.
In order to enhance these models, we conduct a large-scale subjective
experiment collecting a vast number of real human feedbacks on low-level
vision. Each feedback follows a pathway that starts with a detailed description
on the low-level visual appearance (*e.g. clarity, color, brightness* of an
image, and ends with an overall conclusion, with an average length of 45 words.
The constructed **Q-Pathway** dataset includes 58K detailed human feedbacks on
18,973 images with diverse low-level appearance. Moreover, to enable foundation
models to robustly respond to diverse types of questions, we design a
GPT-participated conversion to process these feedbacks into diverse-format 200K
instruction-response pairs. Experimental results indicate that the
**Q-Instruct** consistently elevates low-level perception and understanding
abilities across several foundational models. We anticipate that our datasets
can pave the way for a future that general intelligence can perceive,
understand low-level visual appearance and evaluate visual quality like a
human. Our dataset, model zoo, and demo is published at:
https://q-future.github.io/Q-Instruct.Comment: 16 pages, 11 figures, page 12-16 as appendi